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We consider the maximum entropy problems associated with Renyi Q-entropy, subject to two kinds of constraints on expected 
values. The constraints considered are a constraint on the standard expectation, and a constraint on the generalized expectation as 
encountered in nonextensive statistics. The optimum maximum entropy probability distributions, which can exhibit a power-law 
behaviour, are derived and characterized. 

The Renyi entropy of the optimum distributions can be viewed as a function of the constraint. This defines two families of entropy 
functionals in the space of possible expected values. General properties of these functionals, including nonnegativity, minimum, 
convexity, are documented. Their relationships as well as numerical aspects are also discussed. Finally, we work out some specific 
cases for the reference measure Q(x) and recover in a limit case some well-known entropies. 
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1. Introduction 

Consider two univariate continuous probability distributions with densities P and Q with respect to the Lebesgue 
measure. The Renyi information divergence introduced in [32] has the form 



where a is a positive real and V the domain of definition of the integral. In the discrete case, the continuous sum is 
replaced by a discrete one which extends on a subset V of integers. The opposite Hq^ (P) of the Renyi information 
divergence can be viewed as a Renyi entropy relative to the reference measure Q, and can be called Q-entropy. By 
L'HospitaFs rule, Kullback divergence is recovered in the limit a — *■ 1. 

Applications and areas of interest in Renyi entropy are plentiful: communication and coding theory [10], data min- 
ing, detection, segmentation, classification [29,5], hypothesis testing [23], characterization of signals and sequences 
[38,19], signal processing [5,3], image matching and registration [29,15]. Connection with the log-likelihood has been 
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outined in [33], where is also defined a measure of the intrinsic shape of a distribution which can serve as a measure of 
tail heaviness [27]. Renyi entropies for large families of univariate and bivariate distributions are given in [25,26]. Di- 
vergence measures based on entropy functions can be used in the process of inference [ 12], in clustering or partionning 
problems [22,2,7]. 

Renyi entropy also plays a central role in the theory of multifractals, see the review [18] and [4], In statistical physics, 
following Tsallis proposal [34,35] of another entropy (which is simply related to Renyi entropy), there has been a 
high interest on these alternative entropies and the development of a community in "nonextensive thermostatistics". 
Indeed, the associated maximum entropy distributions exhibit a power-law behaviour, with a remarkable agreement 
with experimental data, see for instance [6,35] and references therein. These optimum distributions, called Tsallis 
distributions, are similar to Generalized Pareto Distributions, which also have an high interest in other fields, namely 
reliability theory [1], climatology [24], radar imaging [21] or actuarial sciences [8]. 

Jaynes' maximum entropy principle [16,17] suggests that the least biased probability distribution that describes a 
partially-known system is the probability distribution with maximum entropy compatible with all the available prior 
information. When prior information is available in the form of constraints on expected values, the maximum entropy 
method amounts to minimize Kullback information divergence D(P\\Q) (or equivalently maximizing Shannon Q- 
entropy) subject to normalization and these an observation constraints. In the case of a single constraint on the mean 
of the distribution, say Ep[X] = to, the minimum of Kullback information in the set of all probability distributions 
with expectation m is of course a function of m, denoted F(m) as follows 

minpD(P||Q) 

T{m) = <( s.t. m = E P [X] (2) 
and Jj, P(x)dx = 1 

It is a 'contracted' version of Shannon Q-entropy and is called a level- 1 entropy functional, or rate function, in the 
theory of large deviations, e.g. [11]. The maximum entropy method is a widely and successful method extensively 
used in a large variety of problems and contexts. 

We focus here on solutions and properties of maximum entropy problems analog to (2) for the Renyi information 
divergence (1), and on the associated entropy functionals. The maximum Renyi-Tsallis entropy distribution, with its 
power law behavior, is at the heart of nonextensive statistics, but have also be considered in [13,14]. In nonextensive 
statistics, one still consider the usual classical mean constraint, but also a 'generalized' a-expectation constraint. This 
'generalized' a-expectation is in fact the expectation with respect to the distribution 

P * (x) = WW 1 '" (3 ) 

that is a weighted geometric mean of P and Q. It is nothing else but the 'escort' or zooming distribution of nonextensive 
statistics [35] and multifractals. Of course, with a = 1, the escort distribution P* reduces to P and the generalized 
mean Ep- [X] reduces to the standard one. 

Therefore, the maximum entropy problems associated to Renyi information divergence (1), subject to normalization 
and to a classical (C) or generalized (G) mean constraint states as: 

minp D a (P\\Q) 
s.t. {C)m = Ep[X\ 
or (G)m = E P , [X] 
and J v P(x)dx = 1 

(C) (G) 

where Ta (to) and Ta (to) are the level-one entropy functionals associated to Renyi Q-entropy for the classical an 
generalized constraints respectively. Since Renyi entropy reduces to Shannon's for a = 1, functionals !Fa\m) will 
reduce to F(m) when a — > 1. 



^Cresp.G) (m) = 



(4) 



2 



Hence, in this paper, we consider the forms and properties of maximum entropy solutions associated to Renyi Q- 
entropy, subject to two kind of constraints, as explained above. The value of the maximum entropy problems at the 
optimum define classes of entropy functionals F a ' (m) associated to each choice of reference Q, and indexed by 
the parameter a. The introduction of the reference measure Q, and therefore the definition of functionals !F a (m) 
is, to the best of our knowledge, new in this setting. In section 2, the exact form of the probability distributions P 
that realize the minimum of the Renyi information divergence in the right side of (4) are first derived. Then we give 
some properties of these distributions and of their partition functions. We show that the entropy functionals Ta (m) 
are simply linked to these partition functions. General properties of the entropy functionals, including nonnegativity, 
convexity, are established. We also indicate how the problems (4) can be tackled numerically, for specific values of the 
constraints, even thouh the maximum entropy distributions exhibit implicit relationships. A divergence in the object 
space, that reduces to a Bregman divergence for a — > 1 is defined. These results are illustrated in section 3 where we 
study four special cases of reference Q, and characterize the associated entropy functionals. It is then shown that some 
well-known entropies are recovered. 



2. The minimum of Renyi divergence 



Let us define by 

pax) = [i+7(*-*)r QOc); (5) 

a probability density function on a subset T> of M, where V ensure that the numerator of (5) is always nonnegative and 
its integral finite. The normalization Z v (7,2;) is the partition function defined by 

Z v { 1 ,x) = [ [1 + 7(0; -x)]"Q(x)dx (6) 
Jv 

The density P v depends of three parameters: the exponent v which can be considered as a shape parameter, a scale 
parameter 7 and a location parameter x. But these parameters can be also be linked. For instance, x might be a function 
of v and 7. When non ambigous, we may also denote by E V \X\ the statistical mean with respect to P v {x). 

With these notations, we have the following result. 
Theorem 1 

(C) The distribution Pc(x) in the family (5) with v = £ = — ^ andx = Ep[X] = E^[X], has the minimum Renyi 
divergence to Q 

D a (P\\Q) > D a (P c \\Q) (7) 

for all probability distributions P(x) absolutely continuous with respect to Pc(x) with a given (classical) expecta- 
tion x. 

(G) The distribution Pg(x) in the family (5) with v = — £ = andx = Ep*[X] = E_^ + i^[X}, has the minimum 
Renyi divergence to Q 

D a (P\\Q) > D a (P G \\Q) (8) 

for all probability distributions P{x) absolutely continuous with respect to Pg(x) with a given generalized expec- 
tation x. 

Corollary 2 The solution to the minimization of Renyi divergence in (4) is as given in theorem 1 for the particular 
values 7* 0/7 such thatx = m. 

It is important to emphasize that x is here a statistical mean, and not the constraint m, and as such a function of 7. 
Proof. See Appendix A ■ 

Remark 3 When a tends to 1, \i>\ tends to +00. Let us introduce 7 such that 7 = j/f. Then 

P v (x) = e^i+^-^-^s^^Q^), (9) 

and 

lim P v (x) = e^ x - s) - losZ "^'^Q(x), (10) 

I v\— > + oo 
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that is the standard exponential, which is the well-known solution of the minimisation of Kullback-Leibler divergence 
subject to a constraint on an expected value [20, Theo 2.1, page 38]. In this case, the log-partition function becomes 

lim log Z v (7, x) = 7a; - log / e <x Q(x)dx (11) 

M->+oo J v 

Properties of entropy functionals F a (jn) and F a (m) are of course linked to the properties of the optimum dis- 
tribution (5) and its partition function (6). In Property 4, we characterize partition functions of successive exponents, 
which enables to derive the expression of the Renyi entropy associated to the optimum distribution. In Proposition 
6, we give the expression of the derivative of the partition function with respect to 7. Since the optimum distribution 
(5) is 'self-referential' (because it depends of its mean, which gives an implicit relation), direct determination of its 
parameters is difficult. It could rely on tabulation or on iterative techniques [36], that still suppose that the solution 
is an attractive fixed point. We define in Proposition 9 two functionals whose maximization provide the 7 parameter 
of the optimum distributions associated to the classical and generalized mean constraint. Then general properties of 
nonnegativity, minimum, convexity are then given in Proposition 1 1 . We also show that the two entropy optimization 
problems are related and that functionals ^F^\m) obey a special symmetry. Finally, we define a divergence in the 
space of possible means. 

Property 4 Partition functions of successive exponents are linked by 

Z v+ i(i,x) = E v+ i- k (7(2: - x) + \) k Z v+ i-k{l,x)- (12) 

An interesting particular case is for k=l: 

Z„ + i(7,x) = E v [7(2: - x) + 1] Z v (i,x). (13) 

This is easily checked by direct calculation. As a direct consequence, we may also observe that Z v+ -y (7, x) = Z v (7, x) 
if and only if x — E v [x] . When x is a fixed parameter m, this will be only true for a special value 7* such that 

E v [x] = m. 

Now, using ( 1 3) in Property 4, it is possible to give the expression of the Renyi divergence associated to the distribution 
(5) and in particular to the solutions Pc and Pq of problems (4): 

Property 5 The Renyi information divergence associated to the optimum distributions (5) in theorem 1 is ( C) D a (P\\Q) 

-logZ e (7,x) = - log Z 5+ 1(7,2), and(G) D a (P\\Q) = - log (7, x) = - logZ_ (?+1) (7,x). 

Proof. 

The Renyi entropy associated to (5) writes 



D a (P\\Q) = — -log / P(x) a Q(xj L - a dx 
a - I J 



a-1 
that simply reduces to 



log / (1 + 7 (x - x)) av Q(x)dx -logZ„(7,x), 



D a (P\\Q) = -^—\ogZ av { 1 ,x) - ^— log Z„( 7j x) 
a — I a — 1 



(C)Inonehand,if I . = e = ^ T ,thenaz/=^ T = e + l,a n d J D a (P||Q) = ^ log (7, 5) - ^ logZ c ( 7 ,x) 
Therefore, when x = E^ [x], then (13) gives ^+1(7, x) — Z^(j, x), and it simply remains 

D a {P\\Q) = -logZ c (7,a) = -logZ £+ i(7,S). 

(G) In the other hand, if v = -f = j^, then av = = -£ - 1, and D a (P\\Q) = ^ log (7, x) - 

-35- log Z_^(7, x). When x = E_^ +1 ) [x], we have Z_^(7, x) = Z_(p +1 \ (7, x) according to (13) and it remains 

D a (P\\Q) =-log^( 7l x) = -logZ_ (c+1) ( 7 ,x). 

■ 

Since the Renyi information divergence of distributions (5) is simply the log-partition function, it will be useful to 
examine the behaviour of the partition function with respect to the parameter 7. Hence, the following proposition 
gives the expression of the derivative of the partition function. 
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Proposition 6 For the partition function (6) with domain of definition T>, the derivative with respect to 7 of the 
partition function with characteristic exponent v is given by 

d _ ( _ dx\ _ 

—Z u (j,x) = v \ E v -x \x-x\- 7— J Z v -i{i,x). (14) 

if (a) the domain T> does not depend ofj, or (b) on subsets ofj such that the domain increment 5T> associated to the 
variation remains empty, or (c)for is > in the continuous case or v > 1 in the discrete case. 
Proof. See Appendix B ■ 

Using this proposition on the derivative of the partition function and Property 4 on the link between partitions functions 
of succesive exponents, we readily have 

Property 7 Ifx = E v ^\ [X] , then, with the same conditions as in proposition 6: 

d dx 

— log Z v (j,x) = --fiy—, (15) 

and 

— log Z„(7, x) = -71/. (16) 

dx 

This is immediately checked using (13) and (14) with x = E v ^\ [X]. It is now interesting to consider the special case 
where x is a fixed value, say m. Then, it is immediate to check that the extrema of the function log Z v (7, m) occur for 
7* such that m = E v -\ [X\. 
Property 8 Ifx is a fixed value m, then 

d , . 

■ log Z v (-y,m) 



d~y 



= 0. (17) 

7=7* 



if and only ifj* is such that m = E v _\ [X]. 

This result is important because it provides an easy way to find the value of the parameter 7 of the optimum distribu- 
tions (5) that solves the maximum entropy problems (4). 

Proposition 9 The values 7* of the parameter 7 of the optimum distributions that solve the maximum entropy prob- 
lems (4) are the minimum of the maximizers of 

D c (7)=-log^+i(7,m) (18) 
D G (7) = -logZ_ s (7,m) (19) 

where the two partitions functions involved are convex, possibly on several well defined intervals. Then, the entropy 
functionals Ta are simply given by 

jr(.Cres P .G)( m) = DcrespG(rr *^ (20) 

Proof. Indeed, Theorem 1 and its corrolary indicates that the solution for the classical constraint (C) is obtained for 
x = m = E^ [X] and by x = m = [X] for the generalized constraint (G). Then by Property 8 it suffices to 

look for the extrema of -De (7) = — l°g^+i(7i m ) m tne fi rst case or of I?g(7) = — l°g^-5(7i m ) m tne second 
case. With similar conditions of derivation as in Proposition 6 the second derivative of the partition function with 
respect to 7 writes 



d 2 Z u (-f,m) f 2 



d<y 2 



u(v-l) I (x - mf [1 + 7 (x - x^Y 2 Q(x)dx (21) 



■p 



= v(y - 1)£„_ 2 [{X - m) 2 ] Z„_ 2 ( 7 , m). (22) 

For v = £ + 1 and v = — £, the factor v(y — 1) reduces to r^ip ■ Since a is positive, the second derivative is always 
positive and the partition functions Z^ + i(j, m) and Z-^(j, m) are convex on their domain of definition. On these 
domains, the functionals in (18) and (19) are then unimodal and their extrema are maxima. 

In the discrete case and for v < 0, Z u (j, m) has singularities for all 7 = m 1 _ k , where k is an integer in the support 
of the distribution. Therefore, Z u (--f,m) is only defined on segments ^ m ^ fc , m _ 1 k _ 1 ^j, for m (k + l,fc)), and 
for m G (k + l,k). In such a case, — \ogZ u {~f,m) may present several maxima. The situation 



l—k—l 1 m — k 
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v < occurs for the classical constraint when a E (0, 1) (since the index £ + 1 = a/(a — 1) is negative), and 
for the generalized constraint when a > 1. An example of functional Dcij) with a = 0.5 in the case of a Poisson 
distribution is reported in Fig. 6. In the v > discrete case or in the continuous case, there is a single maximum. 
Finally, since the expression of the Renyi information divergence of the optimum distributions is precisely the opposite 
of the log-partition function as indicated in Property 5, the value of functionals (18) and (19) at their optima 7* such 
that x = m is precisely the value of entropy functionals Ta (jn) and T a a ^ (m). ■ 
Remark 10 When a tends to 1, the parameter 7* is thus the maximizer of (11), and we obtain 



that is the Cramer transform of Q(x). 

With the help of these different results it is now possible to characterize more precisely the entropy functionals 
Proposition 11 Entropy functionals T a C ^ (m) and T a G ^ (to) are nonnegative, with an unique minimum at rriQ , the 
mean ofQ, and T a \mQ) = 0. Furthermore, T a C \m) is strictly convexfor a G [0, 1]. 

Proof. Renyi information divergence D a (P| \Q) is always nonnegative, and equal to zero for P = Q. Since functionals 
T a \x) are defined as the minimum of D a (P\ \Q), they are always nonnegative. If P = Q, we have also P* = Q and 
111, = Ep[X] =Ep* [X] — niQ. Therefore Tk\niQ) = and toq is a global minimum. 

From (16), we have -4= log Z v +± (7, x) = —^(y + 1). Then, functionals T a \x) are only minimum if 7 = 0, and the 
corresponding optimum probability distributions are simply P = Q, and D a (Q\\Q) = 0. Therefore, Ta\x) have an 
unique minimum for x = rriQ, the mean of Q, and (tuq) = 0. 
Finally, we examine the convexity of Ta (m), for a £ [0, 1]. 

Let Pi and P2 be the distributions that achieve the minimization of D a (P\ \Q) subject to the constraints x\ = Ep [X] 
andx 2 = E P [X] respectively. Then, T ( a\xi) = D a (P 1 \\Q), and T^\x 2 ) = D a (P 2 \\Q). In the same way, denote 
Fa (m^X + (1 ~ 1 L ) X 2) = D a (P\\Q), where P denotes the optimum distribution with mean fixi + (1 — pi)x 2 - 
Distributions P(u) and p,Pi(u) + (1 — p)P2{u) have the same mean \ix\ + (1 — p)x 2 - Hence, when D a (P\\Q) is a 
convex function of P, that is for a G [0,1], we have D a (P*\\Q) < nD a (Pi(u)\\Q) + (1 - p,)D a (P 2 (u)\\Q), thatis 
F a C \nxi + (1 — pi)x 2 ) < (iFa (x±) + (1 — /J,)J- a C \x 2 ) and F^ix) is a convex function. ■ 
Up to now the two optimization problems have been considered in parallel. But here is a special symmetry that enables 
to relate the solutions of the minimization of Renyi divergence subject to classical and generalized mean constraints. 

(Q\ (G\ 

Then, there exists a simple relationship between the entropy functionals T a \Xj and T a (x). 
Let us consider our original Renyi divergence minimization problem, on one hand with index ol\ and subject to a 
classical mean constraint m, and on the other hand with index a 2 and subject to a generalized mean constraint m. 
The associated functionals, by Property 9, are Dc(j) = — log Z^ 1+ i(j, m) and Pg(7) = — log Z—£ 3 (7i m )- Thus, 
we will have pointwise equality of these functions if £1 + 1 = —£, 2 , that is if indexes ot\ and a 2 satisfy a.\ = \j(x 2 . 
In this case, we will of course have equality of the optimum parameters 7, and the two optimization problems will 
have the same optimum value. Because of the pointwise equality functions Pg(t) an d Deify), it is clear that the 
associated divergences are equal at the optimum, that is D ai (Pc\\Q) = D a2 (Pa \\Q)- Besides this is easily checked 
in the general case: for the escort distribution P*(x) in (3), we always have the equality D± (P* \Q) = D a {P\ \ \Q). 
Hence, the minimization of the a Renyi divergence subject to the generalized mean constraint is exactly equivalent to 
the minimization of the 1 /a Renyi divergence subject to the classical mean constraint 



so that generalized and classical mean constraints can always be swapped, provided the index a is changed into 1/a, 
as was argued in [31,28]. Hence, equality (24) enables us to complete the characterization of entropy functionals 




(23) 




(24) 



T a (to) and T a (to): 
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Property 12 Entropy functionate J 7 a (m)andT a (to) admit the symmetry J- ~ a (x) — J-^^(x). Besides, J- a (to) 
is strictly convex for a £ [0,1] and T a (to) is strictly convex for a £ [1 , +oo] . 

Interestingly, it is also possible to define a divergence in the object space, that is a kind of generalized distance 
between two "objects". These divergences may be used for instance in clustering [30]. The objects are here considered 
as generalized means of distributions with minimum divergence to a reference measure Q(x). 

Proposition 13 If Pi and Pi are two distributions in (5) with exponent v — ~ £ {generalized constraint), with Pi <C 
Pi, and with respective parameters 7 i, 72 and means mi, mi, then 



T^\m 2 ,mi) = £> Q (P 2 ||Pi) = ^ G) (m 2 ) -^£ G) i 



mi) 



1 , [, , ^dA C) 



a — 1 \ dm 

and T a (mi, mi) > 0, with equality if and only if mi = mi. 
Proof. The result is obtained by simple computations. First, we have 



log 1 - (a - 1) — (toi)(to 2 - toi) , (25) 



Z_ ? (72,TO 2 ) a Z^ ? (7i,TOl) i c 



which can be rewritten as 



D a (P 2 1 |Pi ) = —— (a log Z_ 5 ( 72 , m 2 ) + ( 1 - a) log ( 7l , mi ) - log _ 1 ( 72 , m 2 ) ) (26) 
1 — a 



log 



a - 1 



I + 72X-TO2 e 1 - . 

71 (x-mi) r Q(x)dx 

2-f_i(7 2 ,m 2 ) 



(27) 



In the first line, we have Z_tt+i) (72, "12) = ^-^(72, ^2) by Property 4, eq. (13), and we recognize from Proposition 

(G) 

9 that Fa (to) = — log Z-^ (7, to). In the second line, the integral reduces to (mi — toi ) since to 2 is the generalized 
mean of the distribution Pi. Finally, 71 can be expressed as the derivative of the log -partition function as stated by 
(16) in Property 7. 

(G) 

By definition, f a (mi, mi) is the Renyi information divergence D a (Pi\\Pi) which is always greater or equal to 
zero, with equality if and only if P 2 = Pi, which implies m2 = mi. ■ 

For a — > 1, J- a G \mi,mi) reduces to a standard Bregman divergence. Indeed, using log(l — x) ~ — x, we have 
simply 

lim T a G) (m 2 , toi) = T a G) (m 2 ) - T a G) (mi) ^(mi)(m 2 - mi). 

a->l GtTO 



3. Examples of entropy functionals 



We now examine 4 special cases for the reference mesure Q(x): a uniform and an exponential distribution that model 
systems with continuous states; and then a Bernoulli (two-levels) and a Poisson distribution which may model systems 
with discrete states. The minima of the Renyi divergence, that is the entropies T a C ° r G ' (x), are attained for the values 
7* that maximize the functionals -De (7) an d P > c(l) m Proposition 9. This involves the computation of Z„(j, m) for 
all reference measures Q considered, and the resolution of -^Z u+ i (7, to) = 0. The case a = 1 is obtained in the limit 
\f \ — > +00, since |£| — > +00 when a tends to 1. Results of numerical evaluations for varying a are provided. 



3.1. Uniform reference 



Let us first consider the case of the uniform reference Q(x) on [0, 1. The partition function is given by Z„( 7 , m) = 
It> \l( x ~ m ) + 1]" ^- T ' where the domain V is defined by P = T>q n 7_? 7 , with Pq = {1 : 1 £ [0, 1]} and T> 1 = 
{x : -f(x - m) + 1 > 0}. 
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(C) 

Fig. 1. Entropy functional [xj for a uniform reference mea- 
sure and a G (0, 1). 



(G) 

Fig. 2. Entropy functional JF„ (x) for a uniform reference mea- 
sure and a G (0, 1). 



Computation of the partition function in the different domains together with the fact that m G [0,1] leads to 

1 /, -w,+1„, 1 



7(1 + ") 
for all 7 if v > 0, for 7 £ 



(7 — 7m + l) u+ U(~f — 



-)-(-7m + ir +1 C/(-7+-) 
1 m 



1 



1 



to — 1 m 



if < 0, and Z v {p/,m) = +00 otherwise, 



where U denotes the Heaviside distribution: U(t) = for t < and f/(i) = 1 for t > 0. 
The first derivative of the partition function is given by 



5^(7, m) 



1/7 (to — 1) + 1 
7 2 (^ + l) 



(7 (to - 1) + iy u(j 



^) + ^ T |(l-7-rt/(-7+-). (28) 
"f 1 [v + 1) to 



1 



We next have to look for the expression of entropy functionals Ta\x). Unfortunately, no analytical solution can be 
exhibited here, but the two functionals still can be evaluated numerically. For the classical mean constraint (C) we 
can check that Ta{x) is a family of convex functions on (0, 1), minimum for the mean of the reference measure 
Q, as was indicated in Proposition 1 1 . In the same way, we can check that for the generalized mean constraint (G) 
Ta (x) is a family of nonnegative functions on (0, 1), also minimum for the mean of the reference measure Q. The 
entropies Ta (x) and Ta (x) were evaluated numerically and are given in Figs. 1 and 2 for a E (0, 1). Of course, 
the a <-> 1/a duality given in Property 12 enables to extend these two functionals for a > 1. 

Hence, it is apparent that the minimization of Ta\x) under some constraint would automatically lead to a solution 
on (0,1). Moreover, the parameter a may serve to tune the curvature of the functional and the degree of penalization 
of bounds. 



3.2. Exponential reference 

The exponential probability density function is Q(x) = (ie~^ x , for x > and /3 > 0. The partition function is given 
by 

Z u {j,m) = (3[ [ 7 (x - to) + If e'^dx (29) 



where T> = jx : x > max |o, to — ^ j if 7 > or x € [0, to — ^] if 7 < |, ensures that the integrand [7(x — to) + 1] 

is nonnegative and the integral finite. 

The evaluation of Z v ("j, to) on the different domains gives: 
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a -rm-1 / \ V 7 > — > 



Z„ (7,771) = < 



e f 



v > 

' ^>7>0 (30) 
^ > 



-(f) '(r(^ + i,/3^^)-r> + i)) if 



and Z„(7, to) = +oo for 7 < or 7 > -i- if < 0. 

Let us now examine the behavior of the entropies Ta\x) when a — ► 1. This amounts to study Z„(7,to) and its 
maximum when — * +00. 

The simplest derivation is as follows. As in Remark 3, let 7 = 7/^, so that (1 + 7(2; — to)) 1 ' ~ exp(7(x — to)). In 
this case, one easily obtain that 

log.Z„(7,m) ~ logjfl - 7TO - log(/3 - 7), (31) 

whose derivative is equal to zero for 

7* =13--. (32) 
to 

We shall also note that if v < 0, the sign of 7 = 7/1/ is the sign of (1 — (3m) . Since ^(7, to) is only defined for 
7 > when v < 0, it means that we only have a solution for m < 1//3. Indeed, for 7 > and v < 0, the factor 
(1 + 7(2; — m)) L/ is decreasing, and consequently the mean of the optimum distribution (5) cannot be greater than the 
mean of the reference distribution, toq = Eq [X] = 1 / j3. 
With the optimum value 7*, the log partition function becomes 

log Z v (7*, to) ~ — ((3m — 1) + log (/3m) (Vm if v — > +00, form < l//?if v — > — 00). (33) 

Finally, we thus obtain 

•^Si(z) = - log Ze+i(l*,x) - {fix - 1) - log 08a:) , (34) 
for x < 1/(3 when a tends to 1 by lower values, and for all x if a tends to 1 by higher values. By the duality property 
12, this expression is also the limit form of functional Fa (x). 

As was expected, the functional ((3x — 1) — log {(3x) is strictly convex, positive and zero for x = 1/(3, the mean of 
the exponential distribution. It was employed in speech processing and is called the Itakura-SaA~to entropy functional. 
For (3 = 1, it reduces to the so-called Burg entropy that is well-known in spectrum analysis. 

The entropy functionals can be evaluated numerically. For instance, (x) is given on Fig. 3 for a > 0. It is a family 
of nonnegative functions, equal to zero for x = toq = 1/(3, and convex for a 6 [1, +00). 

3.3. Bernoulli reference 

Let us now consider the case of the Bernoulli measure Q(x) = (38(x) + (1 — (3)S(x — 1). Of course, the (generalized) 
mean of optimum distributions is somewhere in the interval [0, 1]. When 7 is outside of the interval (^zrr, ^), the 
probability distribution reduces to a pure state — S(x) or S(x — 1), and its (generalized) mean is or 1. Incorporation 
of the bounds into the domain depends on the sign of v : for v < 0, ^(7, to) diverges to +00 on the bounds whereas 
it remains finite for v > 0. The expression of the partition function follows directly from the definition: 

Z„( 7j to) = [3(1 - im y + (1 - (3)(1 + 7 (1 - to))". (35) 

In contrast to the previous case, it is possible here to obtain an explicit expression of the entropy functionals for any 
a. Indeed, if p denotes the value of the optimum distribution at x = 1, then the generalized expectation is 

Ei=o- p ( JB ) a Q( a! ) 1 ~ a f3 1 - a (l-p) a + (l-P) 1 ~ a P a 
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Fig. 3. Entropy functional (x) for an exponential reference measure with f3 = 1 and a > 0. By Property 12 it is also .Fj^(x). 

and it is therefore possible to express pas a function of m: 

P= r— ^ " -• (37) 

09i-«s)= + ((l-/3) 1 - Q (l-a;))~ 

Now, since the Renyi information divergence is 

D a {P\\Q) = -^—\og [p l - a {l- P ) a + {l-pf-op 01 ] (38) 
a — 1 



it suffices to replace p by the expression (37) which leads to 



1 — a I 



(39) 



The case of the classical mean is even simpler: we have m = p, and Ta (jti) has the expression of the divergence 
in (38) with p replaced by m. It is also interesting to note, and check, that the a *-* 1/a duality of Property 12 links 
these two expressions. 

The limit case a — > 1 is easily derived using L'Hospital's rule. It comes 

'1 - x" 



P ) m 
This expression is the celebrated Fermi-Dirac entropy that is strictly convex, nonnegative, and equal to zero for x = 
Eq [X] = 1-/3, the mean rtiQ of the reference measure. 

Plots of the entropy functionals are given in Figs. 4 and 5 for a € (0, 1) and (3 = 1/2. In both cases, we have a family 
of nonnegative functions, equal to zero for the mean of the reference measure. It can also be checked that {x) is 
convex for a G (0, 1]. 



3.4. Poisson reference 

As a final example, let us consider the case of a Poisson measure Q(x) = ^e"^, for x > 0. Domain D is V = Vq n 
X> 7 , where T>q = N + and T> 1 = {x : j(x — m) + 1 > 0}. The partition function is given by 

Z»(l,m) - - m) + 1]" ^e-". (41) 

v 

Three cases appear, according to the value of 7: 
(a) if — > 7 > 0, then V reduces to T>\ = {x : x e [0, +00)}; 
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Fig. 4. Entropy functional T t 
sure and a G (0, 1). 



(C) 



(x) for a Bernoulli reference mea- 



Fig. 5. Entropy functional T t 
sure and a G (0, 1). 



(G) 



(x) for a Bernoulli reference mea- 



(b) for 7 > ^ the domain is 2?2 = |x : x G [ to — ^ , +00) |; 

(c) when 7 < 0, P = 2? 3 = { x 6 [0, to - ± ]|. 

In these expressions [x\ denotes the floor function that returns the largest integer less than or equal to x; and \x~\ is 
the ceil function, the smallest integer not less than x. 

Closed-form formulas can not be derived in the general case, but only in the case of an integer exponent v. When v is 
not an integer, we will have to resort to the serie (41), possibly truncated for numerical computations. In order to save 
space, we only sketch the derivation in V\. 



Z v (f, m) = (1 - jm) v V (Ox + 1) 



+ 00 

E 

x=0 



^ A* 



(42) 



with 9 



iSlym - I n the serie above the ratio of successive terms j^p^YTely 7 A* ^ s trie ra tio of two completely factored 



polynomials. This indicates that the serie can be written as a generalized hypergeometric function, when v is integer. 
So doing, we obtain 

Z„(7,m) = (1 -^m) u \ v \F\ v \{a, ...,a-b, ...,b-fj.) 

with a = (1 + 6)/6 and b = 1/6 for v > 0; or with a = 1/9 and b = (1 + 6)/6 for v < 0. 
The derivative with respect to 7 is 



Z„+i( 7 , to) = (1 - 7 mf er 1 " (y + 1) ^(x - to) (1 + 6x) 



(43) 



x=0 



that can also be expressed using hypergeometric functions. Formulas for domains T>2 and P3 also involve hypergeo- 
metric functions. With these formulas, or by direct evaluation of (41), functionals -De (7) an d Dg(j) can be evaluated 
and maximized on their domains of definition so as to find the optimum value 7*. 

Given the signs of v and 7, and the supports 2?i, 2?2 and U3, it is already possible to deduce that the solution 7* is 
necessarily in a specific interval. Hence, we obtain here that for v > (respectively for v < 0), solutions associated to 
a constraint to > fj, corresponds to case (a) (resp. case (c)) and that solutions for m < fi correspond to case (c) (resp. 
case (a)). The argumentation relies on the fact that if Pj and Pj are two optimum distributions with supports Pj and 
T>j, with the same (generalized) mean but different parameters, then by Theorem 1 D a (Pj \\Q) > D a (Pi \\Q) if Pj is 
dominated by Pj. 
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In the case to > fj,, v < 0, the solution with minimum divergence is for a distribution P3 in case (c), and furthermore 



we have D a (P^ | \Q) — > 0. This can be seen as follows. Let 1 £ D3 and fc = 



m — — 

7 



so that x < k + 1 . Let now 



7 



e with eg 0, 



-fc-i 



i— fc 



Then the mean of the distribution is given by 

fc-i 



E V [X] 



x=0 



X 



Q(x) + k 



[k 



Z v {^,m 



e v Q{k), 



(44) 



and any value higher than /1 = Eq [X] can be obtained by tuning e, for many values of k. When k increases, 7 = — L^r 
tends to by lower values and P3 tends to Q, which results in D a (P3 \ \Q) — > 0. 

The < case has the specificity that Z v (^f, to) exhibits singularities at 7 = —3^ for all fc > 0. Then £^(7, to), 



with ^ = — (£ + 1) or = £, is only convex on intervals 



or 



(for k + 1 > to > fc), with 



L m— k ' m — fc— 1 J L m— fc — 1 ' m — fc J 

Z y (7, to) = +00 on the bounds of each interval. Consequently, — log Z v {pf, to) may present several maxima. This is 
illustrated in Fig. 6 where function Pc(7) with a = 0.5 presents many extrema. The solution with minimum Renyi 
divergence corresponds to the minimum of these maxima. 

The limit case a — > 1 is obtained with \v\ = |£| — > +oo. According to the discussion above, the optimum 7 
corresponds to case (a) for {to > fj,, v > 0} and {to < /i, 1/ < 0}, and to case (c) for {to < fj,, v > 0}. For case (a), 
the support is T>i, and the derivative of the partition function ^(7, to) is given by (43). In this derivative, the sum can 
be rewritten as 

(45) 



T£(x-m)(l + exy^ 

x=0 



ip (1 + Ox + ey - to (1 + ex)") - 

x=0 



so that Z ;y+ i(7, to) is minimum when the RHS of (45) is equal to zero. We have to solve this equation in 9. Suppose 
that 9 is small and that 9x <C 1 for the significative values of the probability distribution. In this case, we use the 
approximation (1 + 6xf = e ulo ^ 1+e ^ w e u6 

f OO 



, that leads to 



+00 

E 

x=0 



me 



[fie 



v6 



1) = 



The solution is given by 9* = ^ log(- 



that in turn provides 
In ^ 

7 = - 



1 In : 



(46) 



(47) 



In case (a), 7 is positive, and this will be true for 7* if {to > fj,, v > 0} or {to < /i, v < 0}. For the log-partition 
function, when \ v\ — > +00, this leads to 



-log Z v+1 (i*,m) ~ to log h (/i- to). 

In domain P3, the derivative of the partition function Z v (-y, to) is equal to zero if 



(48) 



(x - m) (1 + 6x) v ^ 

2'=0 



= 0, with fc = 



1 

7J 



,7 < 0. 



If 7 is small enough, fc — > +00 and we obtain for v > the same formulation and solution as in T>\ The solution 7* 
in (47) is now negative, that imposes to < fi for v > 0. Finally, we have shown above that if to > ^ with ^ < then 

A»(P 3 ||Q)^0. 

Hence, we obtain that the entropy functionals converge to 



^aLiix) = xln — + (fi- x) 



(49) 



with the restriction that Ta{x) = for x > [i if (C) a < 1 or (G) a > 1. 

This functional is simply the cross-entropy between x and /i or Kullback-Leibler (Shannon) entropy functional with 
respect to [i [9]. It measures a 'distance' between a possible mean (observable) and a reference mean /i, and it 



12 



Fig. 6. Example of functional Dci'y) for the Poisson reference with 
classical mean constraint, with /i = 3, a = 0.5 = —2) and 
m = 1.15. It presents singularities at l/(m — k), Vfc, and maxima 
at 7 = 0.35 and 7 = 1.24. 



Fig. 7. Entropy functional J 7 ^ (x) for a Poisson reference mea- 
sure with fi = 3 and a > 0. For a — ► 1, JF^ (x) converges to 
xln — + (/j — x). 



has been used as a regularization functional in several applied problems, such as astronomy, tomography, RMN, and 
spectrometry. 

As in the previous cases, the entropy functionals (x) and Ta (x) can be evaluated numerically. For instance, 
•F Q G ' 1 (x) is given on Fig. 7 for \i = 3. It presents an unique minimum for m = fx, and we note that it is is not convex 
for small values of a. 

4. Conclusion and future work 

By weakening one of the postulates that lead to the definition of Shannon entropy, Renyi [32] introduced a one 
parameter family of entropy and divergence. Shannon entropy and Kullback-Leibler divergence are recovered in the 
limiting case for the parameter a — > 1. In this work, we considered the maximum entropy problems associated 
with Renyi Q-entropies. We characterized the solutions for a standard mean constraint and for the generalized mean 
constraint of nonextensive statistics. We defined and discussed the entropy functionals as a function of the constraints. 
These entropies were characterized and various properties and relationships were highlighted. We also discussed 
numerical aspects. Finally we illustrated this setting through some specific examples and recovered some well-kown 
entropy functionals. 

Future work will consider the extension of this setting in the multivariate case. An issue that should be examined is 
the fact that the direct multivariate extension of (5) is not separable in the case of a separable reference Q(x); which 
means that some dependances are implicitely introduced in the maximum entropy solution. 

We also intend to investigate a possible underlying geometrical structure of the maximum entropy distributions (5). 
This structure should extend the geometrical structure of exponential families and involve the Bregman-like divergence 
introduced by (25). 

Finally, maximum entropy methods have been successfully employed for solving inverse problems. We intend to 
consider the potential of Renyi entropies and divergence in this field. A simple contribution would be to examine the 
interest of a Renyi entropy functional, e.g. (39), as a potential in a Markov field for image deconvolution or restoration. 

Appendix A. Proof of Theorem 1 

Let us begin with the classical constraint (C). In this first case, we follow the approach of [37]. Consider the functional 
Bregman divergence : 




Of 
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where h(x) is a nonnegative functional, associated to the (pointwise) Bregman divergence d(f,g) built upon the 
strictly convex function — x a for a £ (0, 1). Then 

B Q1 - a (P,P c ) = - f P(x) a - P c (x) a - a(P(x)P c (x) a - 1 ~ Pdxr^Qix^dx (A.l) 
Js 

P(x) a Q(x) 1 - a dx+ [ P c {x) a Q{x) 1 - a dx. (A.2) 



with h(x) = Q(x) 1 ~ a and where S denotes the support of Pc(x). The second line follows from the fact that when P 
and Pc have the same mean x = Ep c [X] = Ep [X], then using the expression in (5) with v = £ = — it is possible 
to check that 

r P(x)P c (x) a - 1 Q{x) 1 - a dx = f Pc{x) a Q(xf- a dx = Z s ( 7 ,i)- Q 

Js 

provided the whole support of P(x) is included in S, which is the case by the absolute continuity of P(x) with respect 
toPc(x). 

The Bregman divergence Bqi-c, (P, P g ) being always positive and equal to zero if and only if P = Pc, the equality 
(A.2) implies that, for a € (0, 1), 

D a {P\\Q) >D a {P c \\Q) (A.3) 

which means that Pc is the distribution with minimum Renyi (Tsallis) divergence to Q, in the set of all distributions 
P <C Pc with a given mean x, for a £ (0, 1). The case a > 1 can be derived accordingly, beginning with the Bregman 
divergence associated to the strictly convex function x a . 

As far as the generalized mean constraint (G) is concerned, let us now consider the Renyi information divergence 
D a (P\ \P G ) from P to P G , with P G given in (5) with v = -£ = 

{a - l)D a {P\\P G ) - log / P(x) Q P G (x) 1 - Q dx, (A.4) 
Js 

with S the support of P G (x), and which can be rearranged as 

(« - wiift) - I 17(1 - 31 + 11 dx (A ' 5) 

+ log / P(x) Q Q(a;) 1 ^ Q dx-(l"a)logZ^_(7,x). (A.6) 
Js 1 ° 

The generalized mean with respect to P appears in the first term, and cancels if P and P G have the same generalized 
mean x and P G 3> P. In such a case, we obtain 

D a {P\\P G ) = log f P(.T) Q Q 1 - Q d.T + logZ^( 7 ,5;) (A.7) 

= £> a (P||Q)-£) a (P G ||g), (A.8) 

where we used the fact that D a (P G \\Q) = — log Z_^_ (7, x) as stated in Proposition 5. Since the Renyi information 
divergence is always greater or equal to zero, we have 

D a (P\\Q) > D a (P G \\Q) (A.9) 

and conclude that P G is the distribution with minimum Renyi (Tsallis) divergence to Q, in the set of all distributions 
P <C P G with a given generalized a-mean x. 

Finally, it is easy to check, given the expression of P G and the fact that o<£ = £ + 1, that the generalized mean of P G 
is also the standard mean of the distribution with exponent v = — (£ + 1), that is Ep£ [X] = Ep^ [X] = P_(^ +1 ) [X]. 
Note that the equality in (A.8), D a (P\\Q) = D a (P\\P G ) + D a (P G \\Q), is a pythagorean equality, which means that 
P G is the orthogonal projection of P on the set of probability distributions with fixed generalized mean x. 
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Appendix B. Proof of Proposition 6 



The exact behaviour depends on the reference distribution Q(x) and on the sign of the exponent v. Because the domain 
of definition V might depend on 7, the derivative of the partition function writes 

dZv ^ X ^ = Ym\^ ^- (Z„(7 + (57, x 1+Sl ) - ^(7, x 7 )) 

where x 1 and x 1 +s 1 now denote the parameter x for distributions with parameter 7 and 7 + ($7. Let us begin with the 
continuous case. If ST) denotes the domain increment associated to the variation £7, it remains 

dZ ^]^^ = f ±( 1 + 7 ( x - Xj )rQ(x)dx (B.l) 
"7 Jv «7 

+ lim — / (1 + (j + S-f)(x-x 1+s ^)YQ(x)dx (B.2) 
57-0 67 J ST> 

Of course, when V does not depend on 7, we only have the first term, and it is easy to obtain (14). Otherwise, in order 
to satisfy the positivity of the integrand, the domain T> is bounded above by ^.t 7 — for 7 < and below by the 
same value for 7 > 0. Then, the second integral, say G, can be expressed as 

G = sign( 7 ) / (1 + (7 + fry) (x - x 7+Sl )) v Q(x)dx (B.3) 

£ign(7) f a vn ( V-l , _ \, m » 

with a = (7 + fry) (x 7+ s 7 — x 7 ) — — , that tends to zero with fry if x 1 is continuous. At first order, we then obtain 



C7+57 7+57. 
G = sign (7) — / y v dy oc 



7 + fry Jo 1 + v 

for v > — 1. Then, it is readily checked that lim,5 7 ^o j^G — for v > 0, so that (B.2) is always zero for v > and 
(14) is true. 

In the discrete case, the partition function is 



Z v (j, x 7 ) = ^2 C 1 + 7 (x - x y )Y Q(x) 



There exists singular isolated values of 7 such that 1+7 (x — x 7 ) =0, for x integer. For such values, the corresponding 
term in the partition function diverges for v < 0. Contrary to the continuous case where the domain of 7 is contiguous, 
the domain of values of 7 ensuring that the partition function is finite will be interrupted by isolated values of 7: the 
domain of possible 7 will be constituted of segments. 

As in the continous case, the derivative of the partition function writes as the sum of two terms, the second one 
involving a domain increment 

dZ ^ ] =^^(l + 7(i- * 7 )) v Q{x) (B.5) 
dl ^ dry 



+ lim — > {1 + (j + S^) (x - x 1+S y))" Q(x) (B.6) 

If T> does not depend on 7, there is no domain increment and the derivative is given by (B.5). When the bounds of 
V depend of 7, the domain increment is given by the integers in the interval ^[.f 7+ a 7 — 7 _|^ 7 ] , [x 7 — ~~|^ ( 7 > 0) 

or (\x-y — ^:J > L^7+i57 — ~,+s~i j) (7 < wnere L^J i s me fl° or function that returns the largest integer less than or 
equal to x; and \x] is the ceil function, the smallest integer not less than x. If 7 belongs in some interval such that 
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the domain increment remains empty, then the derivative is of course simply (B.5). An extension will occur for an 
infinitesimal variation Sj if x 1 — - is precisely an integer, say k, 
Then, the second sum reduces to 



-— - (j + Sj) (x 1+Sl ~ x 7 ) ) Q(k), (B.8) 



G = {l + { 1 + Sj)(k~x 1+Sl )YQ{k) (B.7) 
7 

lim ^-G = lim ( h + Sj) - ^ - -) + " = for v > 1. (B.9) 

<5 7 ->0 07 57-^0 \ 07 7/ 

since all terms in the parenthesis remains finite when 6j — > 0. In such case the derivative reduces to (B.5) and (14) is 
true. 



and finally 
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